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Abstract — The minimum energy, and, more generally, 
the minimum cost, to transmit one bit of information 
has been recently derived for bursty communication when 
information is available infrequently at random times at 
the transmitter. This result assumes that the receiver is 
always in the listening mode and samples all channel 
outputs until it makes a decision. If the receiver is 
constrained to sample only a fraction p e (0, 1] of the 
channel outputs, what is the cost penalty due to sparse 
output sampling? 

Remarkably, there is no penalty: regardless of p > the 
asynchronous capacity per unit cost is the same as under 
full sampling, i.e., when p = 1. Moreover, there is not even 
a penalty in terms of decoding delay — the elapsed time 
between when information is available until when it is de- 
coded. This latter result relies on the possibility to sample 
adaptively; the next sample can be chosen as a function of 
past samples. Under non-adaptive sampling, it is possible 
to achieve the full sampling asynchronous capacity per 
unit cost, but the decoding delay gets multiplied by l/p. 
Therefore adaptive sampling strategies are of particular 
interest in the very sparse sampling regime. 

Index Terms — Asynchronous communication; bursty 
communication; capacity per unit cost; energy; error 
exponents; hypothesis testing; sequential decoding; sensor 
networks; sparse communication; sparse sampling; syn- 
chronization 

I. Introduction 

" N many emerging technologies, communication is sparse 
.and asynchronous, but it is essential that when data is 
available, it is delivered to the destination as timely 
and reliably as possible. Examples are sensor networks 
monitoring rare but critical events, such as earthquakes, 
forest fires, or epileptic seizures. 

For such settings, [OQ characterized the asynchronous 
capacity per unit cost based on the following model. 
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There are B bits of information that are made available 
to the transmitter at some random time v, and need to be 
communicated to the receiver. The B bits are coded and 
transmitted over a memoryless channel using a sequence 
of symbols that have costs associated with them. The rate 
R per unit cost is the total number of bits divided by 
the cost of the transmitted sequence. Asynchronism is 
captured here by the fact that the random time v is not 
known a priori to the receiver. However both transmitter 
and receiver know that v is distributed (e.g., uniformly) 
over a time horizon [1, . . . , A]. At all times before and 
after the actual transmission, the receiver observes "pure 
noise." The noise distribution corresponds to a special 
input "idle symbol" * being sent across the channel (for 
example, in the case of a Gaussian channel, this would 
be the 0, i.e., no transmit signal). 

The goal of the receiver is to reliably decode the 
information bits by sequentially observing the outputs 
of the channel. 

A main result in H] is a single-letter characterization 
of the asynchronous capacity per unit cost C(/3) where 

def log A 



B 



denotes the timing uncertainty per information bit. While 
this result holds for arbitrary discrete memoryless chan- 
nels and arbitrary input costs, the underlying model 
assumes that the receiver is always in the listening mode: 
every channel output is observed until decoding happens. 

What happens when the receiver is constrained to 
observe a fraction < p < 1 of the channel outputs? 
In this paper, it is shown that the asynchronous capacity 
per unit cost is not impacted by a sparse output sampling. 
More specifically, the asynchronous capacity per unit 
cost satisfies 

C(P,p) = C(/3,l) 

for any asynchronism level (3 > and sampling fre- 
quency < p < 1. Moreover, the decoding delay is 
minimal: the elapsed time between when information 
starts being sent and when it is decoded is the same as 
under full sampling. This result uses the possibility for 
the receiver to sample adaptively: the next sample can 
be chosen as a function of past observed samples. In 
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fact, under non-adaptive sampling, it is still possible to 
achieve the full sampling asynchronous capacity per unit 
cost, but the decoding delay gets multiplied by a factor 
1/p or (1 + p)/p depending on whether or not * can 
be used for code design. Therefore, adaptive sampling 
strategies are of particular interest in the very sparse 
regime. 

We end this section with a brief review of studies 
related to the above communication model. This model 
was introduced in 0. Both of these works focused 
mainly on the synchronization threshold — the largest 
level of asynchronism under which it is still possible to 
communicate reliably. In O, (H communication rate is 
defined with respect to the decoding delay, the expected 
elapsed time between when information is available and 
when it is decoded. Capacity upper and lower bounds are 
established and shown to be tight for certain channels. In 
El it is also shown that so-called training-based schemes, 
where synchronization and information transmission use 
separate degrees of freedom, need not be optimal in 
particular in the high rate regime. 

The finite message regime has been investigated by 
Polyanskiy in (31 when capacity is defined with respect 
to the codeword length, i.e., same setting as fH but with 
unit cost per transmitted symbol. A main result in is 
that dispersion — a fundamental quantity that relates rate 
and error probability in the finite block length regime — 
is unaffected by the lack of synchronization. Whether or 
not this remains true under sparse output sampling is an 
interesting open issue. 

Note that the seemingly similar notions of rates inves- 
tigated in 0, [|4l and ID, ||5l are in fact very different. In 
particular, capacity with respect to the expected decoding 
delay remains in general an open problem. 

A "slotted" version of the above communication 
model was considered in |[6l by Wang, Chandar, and 
Wornell where communication now can happen only in 
one of consecutive slots of the size of a codeword. For 
this model, the authors investigated the tradeoff between 
the false-alarm event (the decoder declares a message 
before even it is sent) and the miss event (the decoder 
misses the sent codeword). 

The previous works consider point-to-point communi- 
cation. A (diamond) network configuration was recently 
investigated by Shomorony, Etkin, Parvaresh, and Alves- 
timehr in (7| who provided bounds on the minimum 
energy needed to convey one bit of information across 
the network. 

In above models, although communication is bursty, 
information transmission is contiguous since it always 
lasts the codeword duration. A complementary setup 
proposed by Khoshnevisan and Laneman (H considers a 



bursty communication scenario caused by an intermittent 
codeword transmission. This model can be seen as a 
slotted variation of the purely insertion channel model, 
the latter being a particular case of the general inser- 
tion, deletion, and substitution channel introduced by 
Dobrushin g). 

This paper is organized as follows. Section |II] con- 
tains some background material and extends the model 
developed in |li] to allow for sparse output sampling. 
Section UlTl contains the main results and briefly discusses 
extensions to a decoder-universal setting and to a mul- 
tiple access setup. Finally Section |TV] is devoted to the 
proofs. 

II. Model and Performance Criterion 

The asynchronous communication model we consider 
captures the following general features: 

• Information is available at the transmitter at a 
random time; 

• The transmitter can choose when to start sending 
information based on when information is available 
and based on what message needs to be transmitted; 

• There is a cost associated to each channel input; 

• Outside the information transmission period the 
transmitter stays idle and the receiver observes 
noise; 

• The decoder is sampling constrained and can ob- 
serve only a fraction of the channel outputs. 

• Without knowing a priori when information is avail- 
able, the decoder should decode reliably and as 
early as possible, on a sequential basis. 

The model is now specified. Communication is 
discrete-time and carried over a discrete memoryless 
channel characterized by its finite input and output 
alphabets 

X U {*} and y , 
respectively, and transition probability matrix 

Q{y\x), 

for all y G V and x E X U {*}. The alphabet X may or 
may not include *. Without loss of generality, we assume 
that for all y G y there is some x € X U {*} for which 

Q(y\x) > 0. 

Given B > 1 information bits to be transmitted, a 
codebook C consists of 

M = 2 B 

codewords of length n > 1 composed of symbols from 
X. 

A randomly and uniformly chosen message m arrives 
at the transmitter at a random time u, independent of 
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m, and uniformly distributed over [1, . . . , A], where the 
integer 

A = 2? B 

characterizes the asynchronism level between the trans- 
mitter and the receiver, and where the constant 

P > 

denotes the timing uncertainty per information bit, see 
Fig.ffl 

We consider one-shot communication, i.e., only one 
message arrives over the period [1, 2, . . . , A] . If A = 1, 
the channel is said to be synchronous. 

Given v and m, the transmitter chooses a time a(u, m) 
to start sending codeword c n (m) £ 6 assigned to mes- 
sage m. Transmission cannot start before the message 
arrives or after the end of the uncertainty window, hence 
a(y, m) must satisfy 

v < a(u, m) < A almost surely. 

In the rest of the paper, we suppress the arguments v and 
m of a when these arguments are clear from context. 

Before and after the codeword transmission, i.e., be- 
fore time a and after time a + n — 1, the receiver 
observes "pure noise," Specifically, conditioned on the 
event {y = t}, t € {1, . . . , A}, and on the message to be 
conveyed m, the receiver observes independent channel 
outputs 

Yi, Y 2 , . . . , Ya+w-i 
distributed as follows. For 

1 < % < a{t,m) - 1 

or 

a(t, m) + n<i<A + n — 1, 
the Yi's are "pure noise" symbols, i.e., 

Yi ~ Q(-\*) ■ 
For a<i<a + n — 1 

Yi ~ Q(-\ci- a+1 (m)) 

where Cj(m) denotes the ith symbol of the codeword 

c n (m). 

The receiver operates according to a sampling strategy 
and a sequential decoder. A sampling strategy consists 
of "sampling times" which are defined as an ordered 
collection of random time indices 

S = {(Si, . . . , S e ) C {1, . . . , A+n-l} : S % < Sj, i < j} 

where Sj is interpreted as the jth sampling time. 

The sampling strategy is either non-adaptive or adap- 
tive. It is non-adaptive when the sampling times given 
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Fig. 1. Time representation of what is sent (upper arrow) and 
what is received (lower arrow). The "*" represents the "idle" symbol. 
Message m arrives at time v and starts being sent at time a. The 
receiver samples at the (random) times S\,Si,.. . and decodes at 
time S T based on r output samples. 



by S are all known before communication starts, hence 
§ is independent of Y^ +n ~ 1 . The strategy is adaptive 
when the sampling times are function of past obser- 
vations. This means that Si is an arbitrary value in 
{1, . . . , A + n — 1}, possibly random but independent 
of Y^ +n ~ l and, for j > 2, 

S j = 9j({YsAi<j) 
for some (possibly randomized) function 

9j ^'-'^{Sj-i + 1,..., A + n-l}. 

Notice that £, the total number of output samples, may 
be random under adaptive sampling, but also under non- 
adaptive sampling, since the strategy may be randomized 
(but still independent of the channel outputs Y^ +n ~ x ). 

Once the sampling strategy is fixed, the receiver 
decodes by means of a sequential test (t,<P), where r, 
the decision time, is a stopping time with respect to the 
sampled sequence 

Y Sl ,Y S2 ,... 

indicating when decoding happens^ and where (f> is the 
decoding function, i.e., a map 

0:0^(1,2,. ..,M} 

where 

= {Y Si ,Y S2 ,...,Y St } 

is the set of observed samples. Hence, decoding happens 
at time S r on the basis of r output samples. Since there 
are at most A + n — 1 sampling times, r is bounded by 
A + n-l. 

A code (Q,§,(t,<P)) is defined as a codebook, a 
receiver sampling strategy, and a decoder (decision time 

'Recall that a (deterministic or randomized) stopping time r 
with respect to a sequence of random variables Yi , I2, • • • is a 
positive, integer-valued, random variable such that the event {r = t}, 
conditioned on the realization of Y\, Y%, . . . , Yt, is independent of the 
realization of lt+i, Y t +2, ■ ■ ■, for all t > 1, 
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and decoding function). Throughout the paper, whenever 
clear from context, we often refer to a code using 
the codebook symbol C only, leaving out an explicit 
reference to the sampling strategy and to the decoder. 

Definition 1 (Error probability). The maximum (over 
messages) decoding error probability of a code C is 
defined as 

1 A 

P(£|C) =max- VP m , t (£ m ), (1) 

t=l 

where the subscripts "m, t" denote conditioning on the 
event that message m arrives at time v = t, and where 
£ m denotes the error event that the decoded message 
does not correspond to m, i.e., 

£ m = {4>{0) + m} . 

Definition 2 (Cost of a Code). The (maximum) cost of 
a code C with respect to a cost function k : X — > [0, oo] 
is defined as 



K(Q) = max'S^ k(ci(m)) . 

m — ^ 



i=l 

Assumption: throughout the paper we make the assump- 
tion that the only possible zero cost symbol is When 

* G X the transmitter can stay idle at no cost. When 
•k ^ X then k(x) > for any x G X, which captures the 
situation where a "standby" mode may not be possible 
at zero cost. The other cases — investigated in [fl] under 
full sampling — are either trivial (when X contains two 
or more zero costs symbols) or arguably unnatural (X 
contains a zero cost symbol that differs from * or when 

★ G X and all X contains only nonzero cost symbols). 
Below, P m denotes the output distribution conditioned 

on the sending of message m. Hence, by definition we 
have 



m(0 • 



Definition 3 (Sampling Frequency of a Code). Given 
e > 0, the sampling frequency of a code C, denoted by 
p(Q,e), is the relative number of channel outputs that 
are observed until a message is declared. Specifically, it 
is defined as the smallest r > such that 

minP m (r/S T < r) > 1 -e. 

m 

(Recall that S T refers to the last sampling time.) 

Definition 4 (Delay of a Code). Given e > 0, the 
(maximum) delay of a code S, denoted by d(Q,e), is 
defined as the smallest integer I such that 



We now define capacity per unit cost under the con- 
straint that the receiver has access only to a limited 
number of channel outputs: 

Definition 5 (Asynchronous Capacity per Unit Cost 
under Sampling Constraint). R is an achievable rate 
per unit cost at timing uncertainty per information bit 
/3 and sampling frequency p, if there exists a sequence 
of codes {Cb} and a sequence of positive numbers eb 
with sb B —F such that for all B large enough 

1) Qb operates at timing uncertainty per information 
bit P; 

2) the maximum error probability P(£|Cb) is at most 
£b; 

3) the rate per unit cost 

B 

Wb) 

is at least R — eb\ 

4) the sampling frequency satisfies p(Cb,£b) < P + 

5) the delay satisfied 

-\og(d(e B ,e B )) < e B - 

Notice that the last requirement asks for a subexponential 
delay. 

The asynchronous capacity per unit cost, denoted by 
C(j3,p), is the supremum of achievable rates per unit 
cost. 

Two basic observations: 

• C(f3, p) is a non-increasing function of (3 for fixed 

P\ 

• C(f3, p) is an non-decreasing function of p for fixed 

P. 

In particular, for any fixed /3 > 

maxC(/5,p) = C(J3, 1) . 

p>0 

Capacity per unit cost under full sampling C(/3, 1) is 
characterized in the following theorem: 

Theorem 1 ( Q Theorem 1). For any (3 > 

f !(X;Y) I(X;Y) + D(Y\\Y* 
\E[k(X)y E[k(X)](l + P) 



com) 



max mm 

x 



(2) 



mm J 



v < 



1)>1 



where maxx denotes maximization with respect to 
the channel input distribution Px, where (X, Y) ~ 
Px(-)Q(-\-), where Y+ denotes the random output of 
the channel when the idle symbol * is transmitted (i.e., 

throughout the paper log is always to the base 2. 
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~ Q (•{*)), where I(X;Y) denotes the mutual infor- 
mation between X and Y, and where D(Y\\Y+) denotes 
the divergence (Kullback-Leibler distance) between the 
distributions of Y and l^H ■ 

Let Px* be a capacity per unit cost achieving input 
distribution, i.e., X* achieves the maximum in ©. As 
shown in the converse of the proof of |Q] Theorem 1], 
codes that achieve the capacity per unit cost can be 
restricted to codes of (asymptotically) constant compo- 
sition Px*- Specifically, we have 

„MP.v-W^)l =CW1)(1 -° (1)) (B ^°°' 

where n B {Px*) denotes the length of the Px* -constant 
composition codes achieving C(/3, 1). Now define 

B 



* def / tj 

n B = m.mn B (P x * 

Px* 



mm 



xev C(p, l)E[k(X)] 



where 



def 



7 = {X : X achieves the maximum in (2)} . 

From the achievability and converse of [fl] Theorem 1], 
{n* B } represent the smallest achievable delays for codes 
{C B } achieving the asynchronous capacity per unit cost 
under full sampling C(f3, 1) in the sense that 

d(e B ,e B )>n B (l-o(l)) (S^oo) 

for any e B — > as B — > oo. 

Our results, stated in the next section, say that the 
capacity per unit cost under sampling frequency < 
p < 1 is the same as under full sampling, i.e., p = 
1. To achieve this, non-adaptive sampling is sufficient. 
However, if we also want to achieve minimum delay, 
then adaptive sampling is necessary. In fact, non-adaptive 
sampling strategies that achieve capacity per unit cost 
have a delay that grows at least as 



n 



B_ 

P 



or 



n* B (l + p) 



P 

depending on whether or not * e X. 

We end this section with a few notational conventions. 
We use y x to denote the set of distributions over the 
finite alphabets X. Recall that the type of a string x n E 
X", denoted by P x n , is the probability over X that assigns 
to each a € X the number of occurrences of a within 
x n divided by n fT0| Chapter 1.2]. For instance, if x 3 = 
(0, 1,0), then P x s(0) = 2/3 and P x3 (l) = 1/3. The joint 

3 Y* can be interpreted as "pure noise." 



type P x ™, y ™ induced by a pair of strings x n G X n ,y n G 
y n is defined similarly. The set of strings of length n 
that have type P is denoted by Tp. The set of all types 
over X of strings of length n is denoted by Finally, 
we use poly(-) to denote a function that does not grow 
or decay faster than polynomially in its argument. 

Throughout the paper we use the standard "big-O" 
Landau notation to characterize growth rates (see, e.g., 
CH Chapter 3]). 

III. Results 

In the sequel we denote by C a (/3,p) and C na (f3,p) 
the capacity per unit cost when restricted to adaptive 
and non-adaptive sampling, respectively. 

Our first result characterizes the capacity per unit cost 
under non-adaptive sampling. 

Theorem 2 (Non-adaptive sampling). Under non- 
adaptive sampling it is possible to achieve the full- 
sampling capacity per unit cost, i.e. 

C na (p, p) = C(p, 1) for any f3 > 0, p > . 

Furthermore codes {Qb} that achieve rate ~fC{(3,l), 
< 7 < 1, satisfy 

lim lim inf ^ £B ^ > — 

7->l _B->oo rf 



13 



P 



when 1, and satisfy 



. t d(e B ,e B ) . 1 + p 
lim lim mi : > 

7->l _B->oo n 



B 



P 



when -k ^ X. Finally, the above delay bounds are tight: 
for any e > and 7 close enough to 1 there exists {Qb} 
and EB^OasB^-oo such that 

hmmf^^a + e 

B-s>oo n* B p 

for the case *eX, and similarly for the case * ^ X. 

Hence, even with a negligible fraction of the chan- 
nel outputs it is possible to achieve the full-sampling 
capacity per unit. However, this comes at the expense 
of delay which gets multiplied by a factor 1/p or 
(1 + p)j p depending on whether or not ★ can be used for 
code design. This disadvantage is overcome by adaptive 
sampling: 

Theorem 3 (Adaptive sampling). Under adaptive sam- 
pling it is possible to achieve the full-sampling capacity 
per unit cost, i.e. 

CU/3, p) = C(p, 1) for any (3 > 0, p > . 
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Moreover, there exists {Qb} an d £,8—7-0 as -B—^oo 
such that 

d(e B ,£ B )=n* B (l + o(l)). 

The first part of Theorem |3] immediately follows from 
the first part of Theorem |2] since the set of adaptive 
sampling strategies include the set of non-adaptive sam- 
pling strategies. The interesting part of Theorem |3]is that 
adaptive sampling strategies guarantee minimal delay 
regardless of the sampling rate p, as long as it is non- 
zero. 

What is a an optimal adaptive sampling strategy? 
Intuitively, such a strategy should sample sparsely, with 
a sampling frequency of no more than p, under pure 
noise — for otherwise the sampling constraint is violated. 
It should also sample the entire sent codeword, and so 
densely sample during message transmission — for other- 
wise a rate per unit cost penalty is incurred. The main 
characteristic of a good adaptive sampling strategy is the 
criterion under which the sampling mode switches from 
sparse to dense. If the criterion is too conservative, i.e., if 
the probability of switching under pure noise is too high, 
we might sample only part of the codeword, thereby 
incurring a cost loss. By contrast, if this probability is too 
low, we might not be able to accommodate the desired 
sampling frequency. 

The proposed asymptotically optimal 
sampling/decoding strategy operates as follows — 
details are deferred to the proof of Theorem |3] 

The strategy starts in the sparse mode, taking samples 
at times Sj = \j/p~\, j = 1,2,.... At each Sj, the 
receiver computes the empirical distribution (or type) of 
the last log(n) samples. If the probability of observing 
this type under pure noise is greater than 1/n 2 , the mode 
is kept unchanged and we repeat this test at the next 
round j + 1. Instead, if it is smaller than 1/n 2 , then 
we switch to the dense sampling mode, taking samples 
continuously for at most n time steps. At each of these 
steps the receiver applies a standard typicality decoding 
based on the past n output samples. If no codeword is 
typical with the channel outputs after these n times steps, 
sampling is switched back to the sparse mode. As it 
turns out, the threshold 1/n 2 can be replaced by any 
decreasing function of n that decreases at least as fast 
as 1/n 2 but not faster than polynomially in n. 

We end this section by considering the specific case 
when f3 = 0, i.e., when the channel is synchronous. For 
a given sampling frequency p, the receiver gets to see 
only a fraction p of the transmitted codeword (whether 
sampling is adaptive or non-adaptive) and hence 

C(0,p) = pC(Q,l) 



for any p > 0. 

How is it possible that sparse output sampling induces 
a rate per unit cost loss for synchronous communication 
(j3 = 0), but not for asynchronous communication 
{j3 > 0) as we saw in Theorems [2] and [3]? The reason 
for this is that when (3 > 0, the level of asynchronism 
is exponential in B. Therefore, even if the receiver is 
constrained to sample only a fraction p of the channel 
outputs, it may still occasionally sample fully over, 
say, Q(B) channel outputs, and still satisfy the overall 
constraint that the fraction shouldn't exceed pj^ 

Remark 1. Theorems [2] and \3\ remain valid under uni- 
versal decoding, i.e., the only element from the channel 
that the decoder needs to know is its output alphabet y. 
This is briefly discussed at the end of Section |72 

Remark 2. Consider a multiple access generalization of 
the point-to-point setting where, instead of one transmit- 
ter, there are 

U = 2 vB 

transmitters who communicate to a common receiver, 
where v, < v < (3, denotes the occupation pa- 
rameter of the channel. The messages arrival times 
i>2, . . . , vu} at the transmitters are jointly inde- 
pendent and uniformly distributed over [1, . . . , A] with 
A = 2@ B as before. Communication takes place as 
in the previous point-to-point case, each user uses the 
same codebook, and transmissions start at the times 
{<7i,<T2, . . . ,o~u}. Whenever a user tries to access the 
channel while it is occupied, the channel outputs random 
symbols, independent of the input (collision model). 

The receiver operates sequentially and declares U 
messages at the times 

where stopping time Ti, 1 < i < U, is with respect to 
the output samples 

It is easy to check (say, from the Birthday problem 
MTS) that if 

v < (3/2 

and hence U = o(\^A) = o(2 /3B ^ 2 ), the collision 
probability goes to zero as B — > oo. Hence in the regime 
of large message size, the transmitters are ( essentially ) 
operating orthogonally, and each user can achieve the 
point-to-point capacity per unit cost assuming a per/user 
error probability. We may refer to this regime as the 

4 If over a long trip we have a high-mileage drive, we can still push 
the car a few times without impacting the overall mileage. 
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regime of "sparse transmissions," relevant in a sensor 
network monitoring independent rare events. 

Note that since the users use the same codebook, the 
receiver does not know which transmitter conveys what 
information. The receiver can only recognize the set of 
transmitted messages. 

If the receiver is also required to identify the messages 
and their transmitters, then each transmitter effectively 
conveys B{1 + v) information bits and the capacity per 
unit cost gets multiplied by 1/(1 + v). 

IV. Analysis 

The following two standard type results are often used 
in our analysis. 

Fact 1 (US Lemma 1.2.2]). 

|3>£| = poly (n). 

Fact 2 ( 1101 Lemma 1.2.6]). If X n is independent and 
identically distributed (i.i.d.) according to P\ G J' 3C , then 

V o\y(n)e~ nD ^ p ^ < ¥(X n G 7p 2 ) < e~ nD ™ p i\ 

for any P 2 G 

Achievability of Theorem [2} Fix some arbitrary 
distribution P on X. Let X be the input having that 
distribution and let Y be the corresponding output, i.e., 
(X,Y) ~ P{-)Q(.\.). 

Given B bits of information to be transmitted, the 
codebook C is randomly generated as follows. For each 
message m = 1,...,M, randomly generate length n 
sequences x n i.i.d. according to P, until x n belongs to 
the "constant composition" se^f] 

A n = {x n : \\P x n -P\\ < 1/logn}. (3) 

If (O is satisfied, then let c n (m) = x n and move 
to the next message. Stop when a codeword has been 
assigned to all messages. From Chebyshev's inequality, 
for any fixed m, no repetition will be required with high 
probability to generate c n (m), i.e., 

P n (A n ) -»• 1 as n -»• oo (4) 

where P n denotes the order n product distribution of P. 

The obtained codewords are thus essentially of con- 
stant composition — i.e., each symbol appears roughly the 
same number of times — and have cost nE[fe(X)](l + 
o(l)) as n — > oo where k(-) is the input cost function 
of the channel. 

Case ★ G X: Information transmission is as follows. 
For simplicity let us first assume that 1/p is an integer. 

5 || • || refers to the Li-norm. 



Codeword symbols can be transmitted only at multiples 
of 1/p. Times that are integer multiples of 1/p from 
now on are referred to as transmission times. Given a 
message m available at time v, the transmitter sends 
the corresponding codeword c n (m) during the first n 
information transmission times coming at time > v. 
In between transmission times the transmitter sends *. 
Hence, the transmitter sends 

ci{m) * . . . * c 2 (?n) -k ... *c 3 (m){ }c n (m) 

starting at time a = a(v) = mm{t >: [t/p\ > u}. 

The receiver operates as follows. Sampling is per- 
formed only at the transmission times. At transmission 
time t, the decoder computes the empirical distributions 

induced by the last output samples y n and all the 
codewords {c n (m)}. If there is a unique message m for 
which 

||^(m)^(v)-^QQ(-|-)ll <2/logn, 

the decoder stops and declares that message m was 
sent. If two (or more) codewords c n (m) and c n (m') 
relative to two different messages m and m! are typical 
with y n , the decoder stops and declares one of the 
corresponding messages at random. If no codeword is 
typical with y n , the decoder repeats the procedure at 
the next transmission time. If by the time of the last 
transmission time no message has been declared, the 
decoder outputs a random message. 

We first compute the error probability averaged over 
codebooks and messages. Suppose message m is trans- 
mitted. The error event that the decoder declares some 
specific message m! / m can be decomposed a^| 

{m^ m'} = £iU£ 2 , (5) 

where the error events £i and £ 2 are defined as 

• £i: the decoder stops at a time t between a and 
a + (2n — 2)/p (including a and a + (2n — 2)/p) 
and declares m'; 

• £ 2 : the decoder stops either at a time t before time 
a or from time a + (2n— 1) / p onwards and declares 
m! . 

Note that when event £i happens, the observed sequence 
is generated by the sent codeword. By contrast, when 
event £ 2 happens, then the observed sequence is gener- 
ated only by pure noise. 

Using analogous arguments as in the achievability of 
[I] Proof of Theorem 1] we obtain the upper bounds 

F m (£i) < 2- n ^ X;y )- £ ) 

6 Notice that the decoder outputs a message with probability one. 
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and 

P m (£ 2 ) <A-2~ n ^ x ^ +D ^ Y ^ 

which are both valid for any fixed e > provided that 
n is large enough. 
Combining, we get 

P m (m -> m') <2^«( / ( x ; y )- £ ) 

+ A- 2~ n ( I ( x < Y )+ D ( Y \\ Y *)- £ ) . 

Hence, taking a union bound over all possible wrong 
messages, we obtain that for all e > 0, 

P(£) < 2 B (2- n ^ x > Y ^ 

= ei(n) (6) 

for n large enough. 

We now show that the delay of our coding scheme 
in the sense of Definition [4] is at most n/p. Suppose a 
specific (non-random) codeword c n (m) £ .A is sent. If 

r > a + (n - l)/p, 

then necessarily c n (m) is not typical with Y" +<yTl 1 . 
By Sanov's theorem this happens with vanishing error 
probability and hence 

P(t-ct< (n-l)/p) = 1 - e 2 (n) 

with E2(n) — >• as n — > oo. Hence, since z/ < cr < 
v + l/p, we get 

P(r-z;<n/p) = l-e 2 (n). 

The proof can now be concluded. From inequality 
© there exists a specific code C C A n whose error 
probability, averaged over messages, is less than £i(n). 
Removing the half of the codewords with the highest 
error probability, we end up with a set C of 2 B ~ 1 
codewords whose maximum error probability P(£) is 
such that 

P(£)<2 £l (n), (7) 
and whose delay satisfies 

d(&,e 2 (n))<n/p. 

Now fix the ratio B/n and substitute A = 2@ B in the 
definition of e±(n) (see ©). Then, P(£) goes to zero as 
B — > oo whenever 



Recall that by construction, all the codewords have cost 
nE[fc(X)](l + o(l)) as n — > oo. Hence, for any r\ > 
and all n large enough 

fe(e') <nE[fc(X)](l +ri). (9) 

Condition ([8]) is thus implied by condition 

b . . f J(jQy) j(x ; y) + £)(y||n) ] 

^(60 1 (1 + r/)E[*pO] ' E[fc(X)](l + r?)(l + /?) / 

(10) 

Maximizing over all input distributions and using the 
fact that r\ > is arbitrary proves that C(/3, 1) — where 
C(/3,l) is defined in Theorem [Q — is asymptotically 
achieved by non-random codes with delay no larger than 
n/p with probability approaching one as n — > oo. 

Finally, if 1/p is not an integer, it suffices to define 
transmission times as 

tj = Lj'/pJ- 

This guarantees the same asymptotic performance as for 
the case where 1/p is an integer. 

Case * ^ X: Parse the entire sequence {1, 2, . . . , A + 
n — 1} into consecutive superperiods of size n/p — take 
[n/pj if n/p is not an integer. The periods of duration n 
occurring at the end of each superperiod are referred to as 
transmission periods. Given v, the codeword starts being 
sent over the first transmission period starting at a time > 
v. In particular, if v happens over a transmission period, 
then the transmitter delays the codeword transmission to 
the next superperiod. 

The receiver sequentially samples only the transmis- 
sion periods. At the end of a transmission period, the 
decoder computes the empirical distributions 

Pc n (m),y n ("> ') 

induced by the last output samples y n and all the 
codewords {c n (m)}. If there is a unique message m for 
which 

||P c n (m)>2/ „( v )-P(-)Q(-|OII <2/logn, 

the decoder stops and declares that message m was sent. 
If two (or more) codewords c n (m) and c n {m') relative 
to two different messages m and m! are typical with y n , 
the decoder stops and declares one of the corresponding 
messages at random. If no codeword is typical with y n , 
the decoder waits for the next transmission period to 
occur, samples it, and repeats the decoding procedure. 
Similarly as for the previous case, if at the end of the 
last transmission period no message has been declared, 
the decoder outputs a random message. 
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Following the same arguments as for the case * e X 
we deduce that (fTOb also holds in this case and that for 
the delay we have 

P(r — u<n + n/p) = l — £2 (n) 

for some £2{n) — > as n — > 00. To see this, note that 
a superperiod has duration n/p and that if v happens 
during a transmission period, then the actual codeword 
transmission is delayed to the next transmission period. 

■ 

Delay Converse of Theorem [2]' We consider the 
cases *£ 1 and * ^ X separately. 
Case * G X: Pick some arbitrary 0<p<l,/3>0 such 
that S(/5,p) > 0, and < 7 < 1. Consider a code Q B 
with length n b codewords that achieves rate per unit cost 
7C(/3,/?) — e B > 0, maximum error probability at most 
eb, sampling frequency p(Qb,£b) < p + £b, and delay 
(1b = cI(Qb,£b), for some eb B — ->° 0. The sampling 
strategy S is supposed to be non-adaptive, and for the 
moment also non-randomized. 

Denote by J 7 the event that the decoder samples at 
least ^n B samples of the sent codeword — recall that n* B 
refers to the minimal codeword length, see Section [[II] 
Then by the converse of the HJ Theorem 1] 



l-o(l) (B^oo) 



(11) 



for any message m, where 7' = 7^7) satisfies 7' = 
7' (7) > for any 7 > and lim 7 ^i 7' (7) = 1. 

Further, by our assumption on the error probability 
and on the delay (see Definition @]), we have for any 
message m 

W m {Z c m n{T-v<d B -l})^l (B^oo), 

where Z c m denotes the successful decoding event. This 
implies that for any message m 

W m (0<T-v<d B -l})^l (B^oo), 

since the error probability is bounded away from zero 
whenever r < v. 
It then follows that 

P m ({0 < r - v < d B - 1} n J 7 <) = 1 - o(l) (B 00' 

(12) ' 

Hence, since u is uniformly distributed over 
{1,2,...,A + n — 1}, for B large enough we 
have 

Pm({0 < T - t < d B ~ 1} PI Jy|^ = t) > 

for at least (l-o(l))^ values of t G {1, 2, . . . , A}. Now, 
conditioned on {v = t}, if event 

{0 < r - t < d B - 1} n J T < 



happens (i.e., with non-zero probability), then necessar- 
ily the period {t, t + 1, . . . , t + d B — 1} contains at least 
j'n B sampling times — here we use the fact that S is 
non-randomized. 
It then follows that 



ISI > 



(l-o(l))A 



d 



B 



7 -n B 



Now if 

pd B < n* B (l - e) 
for some arbitrary fixed < e < 1, then 



(13) 



(14) 



(l-o(l)M 



d 



B 



/ * . (1 - o(i))Y . , x 

7 n B — — ; — ; — M(l - o(l)) 



1 



(15) 



as B — > 00. 

Hence, by taking 7' and hence 7 close enough to 1 
and by taking B large enough 

(1 _ (1)) 7 '/(1 - e ) > 1. 

Therefore, if (fl4l holds, from (|T3T > and ([131 1 we get 

|S|>p(l + e'M (16) 

for B large enough and some e' > such that e' — >• 
as £ — > 0. Inequality (Tl6l ) implies that the sampling 
constraint is violated, as we now show. 

Fix an arbitrary < e" < 1. For an arbitrary integer 
l</c<yl + n — 1 and any message m 

Wm(S T > P T(l + e")) 

> P m (S T > pr(l + e")\t > k)¥ m (r > k) 

> W m (S k > p(A + n - 1)(1 + e")|r > fc)P m (r > fe) 

= ^m{S k > (1 + e / ')pA(l + o(l)))P m (r > fc) 

(17) 

where for the second inequality we used the fact that 
the sampling times 51,52,... are non-decreasing and 
the fact that r < A + n — 1. We now show that both 
terms 

r m (S k >(l + £") P A(l + o(l))) 



and 



(r > k) 



are bounded away from zero in the limit B — > 00, for 
an appropriate choice of k. This, by (fTTT ). implies that 

liminf F m (S T > pr(l + e")) > 0, 

i.e., that sampling frequency p is not achievable when- 
ever (fT4l holds. In other words, to achieve a sampling 
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frequency p it is necessary that delay and codeword 
length satisfy 

d B > ^(l-o(l)). 
P 

Let 

k = (1 + 2e")pA . (18) 

Since S k > k, 

S k > {l + 2e") P A, 

and so by choosing e" > small enough we get 

ns k > (l + £ ")pA(l + (l))) = l (19) 

for B large enough. 

Since C B achieves (maximum) error probability < eb 
we have for any message m 

SB > P m (£) 

> P m (£|r <k,u> k)¥ m (T <k,u>k) 

> 2 P m(T < k\u > k)¥(u > k) 

= ip*(r < k)F(u > k) . (20) 

For the third inequality in (1201 note that event {r < 
k, v > k} means that the decoder declares a message 
before the actual message even starts being sent. In 
this case, the error probability is at least 1/2, since a 
message set always contains at least two messages (see 
Section ITT]) . For the last equality in (1201 . note that event 
{t > k} depends only on Yj*, which are i.i.d. ~ 
when conditioned on {v > k} — P* denotes the output 
distribution under pure noise, i.e., when y^ +n ~ 1 is an 
i.i.d. Q+ random sequence. Repeating this last change of 
measure argument we get 

P m (r > k) > P m (r > k\v > k)F(y > k) 

= P*(r > k)¥(u > k) 

> (1 - 2e B /P(^ > k))W{v > k) 

= (1 -o(l))¥(u > k) 

= {l- p{l + 2e")){l-o{l)) B^oo. 

(21) 

The second inequality follows from (l20l . For the second 
and third equality in (|2TT > we use the fact that v is 
uniformly distributed over {1, 2, . . . , A}, hence by (fl8l 

P(z, > k) = (A - k)/A = 1 - (1 - p(l + 2e")) > . 

Since p(l + e") > 0, we have lim inf b-s.oo Pm {j > 
k) > by (l2Tb . yielding the desired claim. 

Finally, to see that randomized sampling strategies 
cannot achieve a better sampling frequency, note that a 



randomized sampling strategy can be viewed as a proba- 
bility distribution over deterministic sampling strategies. 
Therefore, because the previous analysis holds for any 
deterministic sampling strategy, it must also hold for 
randomized sampling strategies rules. 
Case * ^ X: Pick some arbitrary e > and consider 
a code C# with length ns codewords that achieves rate 
per unit cost C(/3, p) — £b, error probability < eb, delay 
ds = cI(Gb,£b), and sampling frequency p(Gb,£b) < 
p + eb for some e b B ^^ > 0. As in the previous case, 
without loss of optimality the sampling strategy S is 
supposed to be non-randomized. 

Because * ^ X, we have k(x) > for any x G X 
and therefore to achieve the full-sampling asynchronous 
capacity per unit cost it is necessary that the codeword 
length remains essentially the same as under full sam- 
pling. More specifically, we must have 

n B < ri B (l + rj(s)) B^oo (22) 

for some 77(e) — > as e — > 0, where n' B denotes 
the number of sampled codeword positions — recall that 
codeword positions are the positions from time a up to 
time a + ns — 1- Note that this is in contrast with the 
case *E 1, where the codeword transmission duration 
can be expanded by transmitting * at no cost. 
Proceeding as for the case ★ G X, we have 

P m ({0 < r - t < d B - 1} n A y(£) ) = 1 - o(l) (23) 



W m ({0<T-t<d B -l}nA j{£) \v = t) >0 

for any t G 23 where S is a certain subset of 
{1,2,..., A] with |S| = (1 - o(l))A. This means 
that for any t G 23 the decoder samples a "block" 
b C § of cardinality at least 7?i# over the period 
[t, t + 1, . . . , t + d B — 1]. Moreover, if we denote by i(b) 
and f(b) the time position within {1, 2, . . . , A + n — 1} 
of the first and the last element of b, respectively, then 
for each block we have f(b) — i(b) < ns- 

Because of the sampling constraint, there are at most 

N _ pA(l + o(l)) 

distinct blocks of size jub- This implies that ds should 
satisfy 

d B > {in B /p + ^/n B )(l - o(l)), 

as we now show. Intuitively, the reason the delay must 
satisfy this bound is that because the codewords must 
now be blocks of symbols, the receiver might as well 
sample in blocks of symbols. Then, the sampling 
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constraint means that, on average, the gap between sam- 
pled blocks grows like ub/p- However, if the message 
arrives at a time v close to the beginning of a block, then 
in addition to waiting until the next block, the message 
must wait for most of the current block before being 
transmitted — close to capacity we cannot afford to miss a 
portion of the codeword other than negligible. Therefore, 
the delay must grow as ub/ p + riB- We formalize this 
reasoning below. 

Suppose, by way of contradiction, that 



d B <in B {l + l/p){l-e) 



(24) 



for some s > 0, and assume for the moment that each 
block bu\ is composed of 7n#(l + o(l)) elements and 
that there are least N(l — o(l)) distinct blocks. 

Define the "occupation" slot of a block as jub plus 
the time interval until the next block. The average 
occupation slot per block is thus 

A 



(l-o(l)) 



N p 

Hence, for any e' > there is a set of at least r]N 
occupation slots each of size at most 

P 

where rj = rj(e') > for any e' > 0. Consider such a set 
of occupation slots for some e 1 > which is specified 
later, let b be a block belonging to one such slots, and 
let b' denote the block coming after 60 Denote by i(b) 
and f{b) the time position within {1, 2, . . . , A + n — 1} 
of the first and the last element of b, respectively. Then 
for B large enough 

/(6 / )-^)>7«b(1 + -(1 + £')) 

P 

and therefore by taking e' > small enough we get 

fib') - i(b) > 7]'n B 

by (1241 ). where rj = r]'(e,e') > for any e > and 
e' > 0. It then follows that for any t G (i(b),i(b)+rj'riB], 
the interval [t, t + 1, . . . , t + d,B — 1] contains neither 
blocks b and b' completely. Therefore, conditioned on 

v G (i(6),i(6) +rfriB\, event 

{r - v < d B - 1} n J 7 

does not happen. It then follows that if d24l) holds for 
some e > 0, then 

limsupP m ({r - v < d B - 1} n J 7 ) < 1 - rjrj' < 1 

s>oo 

7 For reasons that will soon be obvious, b should not be the right 
most block within the set. 



which contradicts (l23l l. 

The above argument assumes that there are N disjoint 
blocks of size 7n#. If there are fewer and possibly 
larger blocks, the arguments easily extend by defining the 
blocks b as any subset of S such that f(b) — i(b) < ub- 

■ 

Proof of Theorem^ We show that C((3,p) = 
C(/3, 1) for any j3 > and < p < 1 and that 
C(/3, 1) can be achieved with codes {Cg} with delay 
d(e_B,£B) = n* B (l + o(l)) as B -> oo. 

Let P be the distribution achieving C(j3, 1) (see 
Theorem [Q- We generate 2 B codewords of length 



n 



log(n) 



as in the proof of Theorem [2] according to distribution 
P. Each codeword starts with a common preamble that 
consists of log(n) repetitions of a symbol x such that 
Q{-\x) + Q(-|*). 

For the proposed asymptotically optimal sam- 
pling/decoding strategy, it is convenient to introduce the 
following notation. Let Y£ denote the random vector 
obtained by extracting the components of the output 
process Y t at t G [a, b] of the form f = \j/p] for non- 
negative integer j. Notice that, for any t > t and I 3> 1, 
contains « /rf samples. 

The strategy starts in the sparse mode, taking samples 
at times Sj = \j / p], j = 1, 2, .... At each j, the receiver 
computes the empirical distribution (or type) 



S i og („)+i 



of the sampled output in the most recent window of 
length log(n). 

If the probability of this type under pure noise is large 
enough, i.e., if 



1 



ii- 



ihe mode is kept unchanged and we repeat this test at 
the next round j + 1. 
Instead, if 

P*(7p.)<^> 

then we switch to the dense sampling mode, taking 
samples continuously for at most n time steps. At each 
of these steps the receiver applies the same sequential 
typicality decoder as in the proof of Theorem [2] based 
on the past n — log n output samples. If no codeword 
is typical with the channel outputs after these n times 
steps, sampling is switched back to the sparse mode. 

We compute the error probability of the above scheme, 
its relative number of samples, and its delay. 

For the error probability, a similar analysis as for the 
non-adaptive case in the proof of Theorem [2] still applies, 
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with pn being replaced by n — log n. In particular, after 
fixing the ratio B/n and thereby imposing a delay linear 
in B, equation ( fTOl ) holds with p = 1. 

For the relative number of samples, we now show that 

P m (r/5 r >p + e B ) n ^?0 (25) 

with eb = l/poly(-B) from which we then conclude 
that e(/3,p) > 6(/3, 1). To do this, it is convenient to 
introduce Zi, 1 < i < A + n — 1, which is equal to one 
if at time i the receiver switches to the dense mode and 
samples the next n channel outputs and equal to zero 
otherwise. Then it follows that 

S T 

t < pS T + n Zi ■ (26) 

i=l 

To see this, note that the number of samples involved in 
the sparse mode is at most pS T and that the number of 
samples involved in the dense mode is at most n J2i=i %i 
(it is actually equal to n Yli=i if we ignore the 
boundary discrepancies that we cannot sample beyond 
time A + n — 1). 
From (1261 

s T 

¥(t/S t > p + e) <F(n^Zi> S T e) 

i=i 

< P(n Zi > S T e, v < S T < v + 2n - 2) 

i=l 

+ P(5 T < v or S T > v + 2n - 2) . (27) 

We now show that the right-hand side of the second 
inequality in (|27T ) vanishes as B — > oo. 

For the first term on the right-hand side of the second 
inequality in (|27T ), since the Zj's are nonnegative 

P(n J] Zi > 5 T e; v < S T < v + 2n - 2) 
i=l 

^+2ri-2 

< P(n ^ ^ > ve) . (28) 

Now, conditioned on u = t, the Zi's, 1 < i < t — 1, are 
binary random variables distributed according to pure 
noise. Hence, 

i+2n-2 

p ( n Zi - = *) 

i=l 

< P* (n J] Zi > te - (2n - 1)) 

i=l 

t - 1 

- (te- (2ra- 1) - (i- l)/n 2 ) 2 

= o(l) (t -> oo) (29) 



where the second inequality follows from Chebyshev's 
inequality and by noting that for 1 < i < t — 1 we have 

Var(Zi) < EZi < 1/n 2 

since the variance of a Bernoulli random variable is at 
most its mean which, in turn, is at most 1/n 2 . 
Therefore, 

v+2n-l 

P(ra Zi> ve) 

i=l 

j A u+2n-l 

+ a zZ p ( n E ^>^k=*) 

= o(l) (B ->■ oo) (30) 

where the last equality follows from (|29l and the fact 
that v is uniformly distributed over {1, 2, . . . , A = e^ B }. 
From d28j and (|30l we get 

P(n Y z i > S t£; v < St < v + 2n - 2) = o(l) 
i=i 

as B — > oo. 

We now show that 

F(S T < v or S T > v + 2?i - 1) -> (B -> oo) . (31) 

That P(5 r < i/) — > follows from the fact that 
F m (£2) where £2 is defined in the proof of 
Theorem H That P(5 T > 1/ + 2n - 1) ->■ follows 
from the fact that with probability tending to one the 
sampling strategy will changes mode over the transmitted 
codeword and that the typicality decoder will make a 
decision up to time v + n — 1 with probability tending to 
one. This last argument can also be used for the delay 
to show that (1(Qb,£b) — ^(l + o(l)) for some eb — > 0. 

Finally, by optimizing the input distribution to mini- 
mize delay (see paragraph after Theorem [T) we deduce 
that C(/3, 1) = C(/3, p) and that the capacity per unit cost 
is achievable with delay n* B (l + o(l)). ■ 

We end this section with a few words concerning the 
Remark []] at the end of Section |nl] To prove the claim 
it suffices to slightly modify the achievability schemes 
yielding Theorems |2] and [3] to make them universal at 
the decoder. 

The first modification is needed to estimate the pure 
noise distribution Q* with a negligible fraction of chan- 
nel outputs. An estimate of this distribution is obtained 
by sampling the first y/A output symbols. At the end of 
this estimation phase, the receiver declares the pure noise 



distribution as being equal to P v va- Note that since v is 
uniformly distributed over {1, 2, . . . , A} we have 

m\PyVz-Q*\\i >e B )^l 

as B — > oo, for some es — > 0. Note also that this esti- 
mation phase requires a negligible amount of sampling, 
i.e., sublinear in A. 

The second modification concerns the typicality de- 
coder which is replaced by an MMI (Maximum Mutual 
Information) decoder (see |[T0l Chapter 2]). 

It is straightforward to verify that the modified 
schemes indeed achieve the asynchronous capacity per 
unit cost. The formal arguments are similar to those used 
in (3j Proof of Theorem 2] (see also (5, Theorem 3] 
which proves the claim under full sampling and unit 
input cost) and are thus omitted. 
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